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Abstract. This article reviews the Author- Topic Model and presents a 
new non-parametric extension based on the Hierarchical Dirichlet Pro- 
cess. The extension is especially suitable when no prior information about 
the number of components necessary is available. A blocked Gibbs sam- 
pler is described and focus put on staying as close as possible to the 
original model with only the minimum of theoretical and implementa- 
tion overhead necessary. 
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1 Introduction 

Probabilistic models to infer the interests of authors have attracted much interest 
throughout the language modeling community, with the Author- Topic model [4] 
as one of its seminal representatives. Multiple modifications to the Author- Topic 
model have been proposed. These modifications assume either a fixed number 
of topics or focus on using authorship information as an additional feature in 
a non-parametric setting with only little resemblance to the structure of the 
original work. This article addresses a complementary problem - representing the 
Author- Topic model in the framework of Bayesian non-parametrics but keeping 
as much as possible of its original structure. While this might be valuable in its 
own right, it is also useful in a more general sense since the steps necessary to 
transform an extension of Latent Dirichlet Allocation (LDA) with a fixed number 
of parameters to an equivalent model that grows the number of parameters with 
the amount data available apply to a broad range of models. 

2 Generative models for documents and authors 

We will describe two different models: The first one relates authors and docu- 
ments via a fixed number of topics, and the second one models the interests of 
authors using a flexible number of topics. Both models are described by using 
the common notation of a document d being a vector of Nd words, Wd, where 
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Fig. 1. Admixture models for documents and authors: (a) The Author- Topic model, 
(b) the non-parametric Author- Topic model (this paper) . 



all Wdi are chosen from a vocabulary with V terms, and jd are the authors of 
document d chosen from the set of all authors of size J. A corpus of D documents 
is then denned by the set {(wi, ji), .., (wdJd)}- 



2.1 The parametric model 

The seminal Author- Topic model [4] has two sets of unknown parameters; J dis- 
tributions 9j over topics conditioned on the authors, and K distributions <pk over 
terms conditioned on the topics - as well as the assignments of individual words 
to authors x d i and topics Zdi- With 9 and </> being integrated out a collapsed 
gibbs sampler is used, analogous to [1], to converge to the true underlying dis- 
tributions of the Markov state variables x and z. The transitions between the 
states of the chain result from iteratively sampling each pair (xdi 7 Zdi) as a block, 
conditioned on all other variables: 

njf + a 

p{z di = k, x dl = j | w dl = t, w_ dl , j d , •) oc — +j. —f k ™ dt (t) (1) 

Tlj ~h 

where n~^ % is the number of times a word of the topic k has been assigned to the 
author j excluding the current instance, and • is used in place of a variable to 
indicate that the sum over its values (e.g. rij. = J2k n o k ) 1S taken. The assignment 
variable z d i — k represents the topic of the i th word in document d being k as 
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x di — j represents the assignment to author j. The term on the right side f k ™ di (t) 
is the posterior density of term t under topic k: 

where n k f l is the number of times a term t has been assigned to topic k again 
excluding the current word from the count. 



2.2 The non-parametric model 

One frequently raised question when applying the Author- Topic model to a 
new data set, is how to choose the number of topics [5]. The Bayesian non- 
parametric framework of the Hierarchical Dirichlet Process (HDP) [6] offers an 
elegant solution to this by allowing a prior over a countably infinite number of 
topics of which only a few will dominate the posterior. Building on the finite 
version of the model we split the symmetric prior a over topics into a scalar 
precision parameter a and a distribution r ~ Dir(-f/K). Taking this to the 
limit K — > oo we get the root distribution for the non-parametric Author- Topic 
model (fig. lb). Analogously to the collapsed gibbs sampler for the previous LDA 
version we integrate over 6j , but keep r as an auxiliary variable to preserve the 
structure of the state transition probabilities in the finite case for the HDP [2]. 
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With f^™™ (t) = y being the prior density of a word w under a new topic 
[6]. The key difference between these equations and the original model (1) is 
that we now have a root distribution r for the HDP over K+l possible states. 
If there are K topics in the current step, then r k +i represents the accumulated 
continuous probability mass of all possible but currently unused topics, allowing 
to choose a new one from a countably infinite pool of empty topics. If the count 
for number of words assigned to a topic goes to zero, the topic is returned to the 
pool of unused topics. 



2.3 Sampling the Root Distribution 

However, the construction of a Markov chain for the non-parametric Author- 
Topic model requires that additionally the root distribution r of the Dirichlet 
processes must be sampled which was not present in the finite version of the 
model. The discrete part of the root distribution guarantees that existing topics 
are reused with probability ^ K r k and the continuous part allows for a new 
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topic to be sampled with probability r k +i [3]. Given the Markov state we begin 
by generating J vectors 

Nkr = TV ; with r = 1, ..,n jk (4) 

r - 1 + ar k 

where rij k arc the number of words for author j which have been assigned to 
topic k. Next, we draw Bernoulli random variables rrij kr ~ Bern(fi j kr ). The 
posterior of the top-level Dirichlet process r is then sampled via 

t ~ Dir{[mi, ..,m fe ],7); with m k = ^ m jkr (5) 

jr 

making r a discrete distribution over K used topics plus one component with 
the probability mass of the infinite possible, yet unused topics. 

3 Discussion 

In this work, we transformed the LDA based Author- Topic model into a non- 
parametric model that estimates the number of components necessary for rep- 
resenting the data. Yet, it will be necessary to empirically evaluate performance 
(i.e. perplexity) of the proposed model on benchmark data sets. While choosing 
the Author- Topic model as an example for such a transformation, we believe 
that many of the considerations made equally hold for a wider range of models 
and can serve as a blueprint for a simple application of non-parametric Bayesian 
priors. 
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